Inference at Scale with Apache Beam 


Danny McCormick 
docs.google.com/presentation/d/1JJiLxXEPJgxspDsVpWvnccaGOU203xVY PRdZZ60mWpE 
(or shorturl.at/gxZ38) 


1. Beam History t Overview 
2. Basic Inference 
3. Problems/Solutions 
o Model Freshness 
o Large Models 
o Specialty Hardware 
4. Where Next? 


Data got big: 
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And neverending! 
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Unified Model for Batch and Streaming 


e Batch processing is a special case of 
stream processing 
e Batch + Stream = Beam 


Build your pipeline in whatever language(s) you want... 


Group by Key 


... with whatever execution engine you want 
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Terms 


e PCollection - distributed multi-element 
dataset 

e Transform - operation that takes N 
PCollections and produces M PCollections 

e Pipeline - directed acyclic graph of 
Transforms and PCollections 


Basic Beam Graph 


Source 
Transform 


Source 
Transform 


Map 
Transform 


Combine 
Transform 


Sink 
Transform 


Sink 
Transform 


Sink 
Transform 


Basic Beam Pipeline 


def add one(element): 
return element + 1 


import apache beam as beam 

with beam.Pipeline() as pipeline: 
pipeline 
| beam.io.ReadFromText( 'gs://some/inputData.txt') 
| beam.Map(add one) 


| beam.io.WriteToText( 'gs://some/outputData') 


Read Text file Map Write to text 


Transform file 


|< Workflows in Beam 


& Workflows outside of Beam 
Data Validation Data Preprocessing 


New 
data 


Model Deployment Model Validaton 


Model iterations 


Challenges of Distributed Inference 


o Efficiently loading models 
e Batching 

e Model Updates 

e Using multiple models 


Distributed Inference with Beam 


e Beam takes care of all of this with the 
Runinference transform 

e Loads model, batches inputs, handles 
updates, and plugs into DAG 


RunInference(model handler=<config>) 


Runlnference 


>>> data = numpy.array([10, 40, 60, 90], 
EE dtype=numpy.float32).reshape(-1, 1) 


>>> model_handler = PytorchModelHandlerTensor ( 
model_class=LinearRegression, 
model params=4'input dim': 1, 'output dim': 1), 
state dict path='gs://path/to/model.pt') 


>>> with beam.Pipeline() as p: 
predictions = ( 


p 

| beam.Create(data) 

| beam.Map(torch.Tensor) % Map np array to Tensor 
| RunInference(model handler=model handler) 

| beam.Map(print)) 


Basic Inference Demo 


colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/run. inference huggingface.ipynb 
(shorturl.at/brvN9) 


You've deployed your model! Now what? 


e New data 
e New training algorithms 
e New models 


e Stop and start your pipeline 
e Pipeline drain/update 
e Automatic model refresh 


Automatic Model Refresh 


e Hot swaps model in live pipeline 

e Manages memory for you 

e No pipeline down time (though 
maybe some inference down time) 


Automatic Model Refresh 


side input pcoll - (pipeline 

| "WatchFilePattern" >> WatchFilePattern(file pattern-file pattern, 
interval-side input fire interval, 

stop timestamp-end timestamp)) 


inferences - (image data 


| "ApplyWindowing" >> beam.WindowInto(beam.window.FixedWindows(10)) 
| "RunInference" >> RunInference(model handler=model handler, 
model metadata pcoll-side input pcoll)) 
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Default configuration: share model across threads 


e Aka the easy case 


>>> model handler = PytorchModelHandlerTensor ( 
model class=LinearRegression, 
model params-4'input dim': 1, 'output dim': 1}, 
state dict path='gs://path/to/model.pt') 


>>> pcoll | RunInference(model handler=model handler) 
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Optional: serve a single model for all processes 


e Reduce memory at cost of 
interprocess communication, 
minimized parallelism 


>>> model handler = PytorchModelHandlerTensor ( 
model class=lLinearRegression, 
large model=True, 
model params=4'input dim': 1, 'output dim': 1}, 
state dict path='gs://path/to/model.pt') 


>>> pcoll | RunInference(model handler=model handler) 
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Optional: serve a single model for all processes 


e Model Manager empowered to 
load/unload models in order to make 
optimal use of memory 


>>> per key mhs = | 
. KeyModelMapping([key?', 'key2', 'key3'l], model handler 1), 
. KeyModelMapping(['foo, bar 'baz'], model handler 2)] 
>>> mh = KeyedModelHandler(per key mhs) 


>>> pcoll | Runinference(model_handler=mh) 


Large Model Demo 


colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/per key models.ipynb 
(shorturl.at/pKNU2) 


GPU/TPU Support 


e Hardware availability dependent on 
runner 
e Beam has some primitives that help 


Beam Primitives for GPUs 


e Resource hints for heterogeneous 
pools 

e Built in detection + framework 
specific responses to GPUs at the 
ModelHandler level 

e Large model setting (revisited) 
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Where next? 


(opportunities | see, not representative of the whole community) 


Inference Space 


e More frameworks 
e Better performance testing/profiling 
e Model Manager Improvements 


Beyond Inference 


e MLIransform for data prep and 
pre/postprocessing 

e Feature Store Enrichment 

e Higher level ML support (e.g. 
anomaly detection) 


Come join our community! 


Ouestions? 
Contact - Danny McCormick (dannymccormick@google.com) 


Slides - https://shorturl.at/jzEQ6 


